Today we’ll need the following R libraries, if you get an error, make sure you have all of them installed with install.packages() function.
library(tidyverse)
library(RColorBrewer) # to create color scales
library(ggrepel) # to crate nice labels for gene names
library(scales) # to look at color we created
Last lecture we looked at point plots and how to plot them based on data.
Today we will look at a customized volcano plot like the one below:
What is a volcano plot?
A volcano plot is a type of scatter-plot that is used to quickly identify changes in large data sets composed of replicate data. It plots significance versus fold-change on the y and x axes, respectively. These plots are increasingly common in omic experiments such as genomics, proteomics, and metabolomics where one often has a list of many thousands of replicate data points (e.g. genes, proteins, metabolites) between two conditions (e.g. normal vs sick) and one wishes to quickly identify the most meaningful changes. In this example we’re looking gene expression (process that converts sequence of a gene in the DNA to messenger RNA).
Dataset we’ll use comes from an experiment where mice were exposed to x-ray (this is also used as a treatment to kill cancer cells) . Then messenger RNA (mRNA) levels were measured from ~27,000 genes in spleen tissue in both control and exposed mice. The table below shows the results from a statistical test comparing these two groups across ~27,000 genes. Conducting the same comparison test with 27,00 genes leads to (multiple comparison (testing) problem)[https://en.wikipedia.org/wiki/Multiple_comparisons_problem]. To overcome this problem and detect the true positives, multiple testing correction was applied to the results, which gives the false discovery rate (FDR).
genes.Spleen.Xray <- read_csv("data/results_miceSpleen_xray.csv")
head(genes.Spleen.Xray)
| GENE | logFC | PValue | FDR |
|---|---|---|---|
| Fam212b | 4.554160 | 0 | 0 |
| Mdm2 | 1.638605 | 0 | 0 |
| Ccng1 | 3.060737 | 0 | 0 |
| Polk | 2.582634 | 0 | 0 |
| Aen | 2.606176 | 0 | 0 |
| Slc19a2 | 2.468342 | 0 | 0 |
This data format is already in appropriate format for our plotting purposes.
We will only add columns:
- logFDR: Negative \(log_{10}\) transformed FDR (false discovery rate) values
- threshold: defining whether a gene qualifies for the set threshold. In this case, it is any gene that satisfies:
- \(FDR < 0.1\) (which is also same as \(logFDR > 1\)) and \(|\ logFC\ | > 0.4\)
We will also pick the top 10 genes and assign as a separate data frame for labels in the plot.
pdat <- genes.Spleen.Xray %>%
mutate(logFDR = -log10(FDR),
threshold = ifelse(FDR < 0.1 & abs(logFC) > .4, TRUE, FALSE) )
#subset top 10 genes to add labels later to our plot
data.label <- pdat %>%
filter(threshold == TRUE) %>%
top_n(n = 10, wt = logFDR)
data.label
| GENE | logFC | PValue | FDR | logFDR | threshold |
|---|---|---|---|---|---|
| Fam212b | 4.554160 | 0 | 0 | 10.962238 | TRUE |
| Mdm2 | 1.638605 | 0 | 0 | 9.834850 | TRUE |
| Ccng1 | 3.060737 | 0 | 0 | 9.834850 | TRUE |
| Polk | 2.582634 | 0 | 0 | 9.834850 | TRUE |
| Aen | 2.606176 | 0 | 0 | 9.741446 | TRUE |
| Slc19a2 | 2.468342 | 0 | 0 | 9.741446 | TRUE |
| Zfp365 | 5.288498 | 0 | 0 | 9.129395 | TRUE |
| Prrg4 | 4.691077 | 0 | 0 | 8.636292 | TRUE |
| Phlda3 | 4.100608 | 0 | 0 | 8.497192 | TRUE |
| Zmat3 | 2.867346 | 0 | 0 | 8.402769 | TRUE |
| Bax | 1.991250 | 0 | 0 | 8.402769 | TRUE |
geom_point layersIn the plot any point below our threshold has different aesthetics mapped to them!! Therefore, we will add two layers of geom_point and provide different datasets and aesthetic mappings to each layer.
One of the differences between these two layers is the shape of the point.
Plotting ‘character’ (pch) There are multiple shapes available for plotting points. Take a look with:
?pch
Back to our plot…
vp <- ggplot(pdat, aes(x = logFC, y = logFDR)) +
geom_point(data = filter(pdat, threshold == FALSE),
size=.35, alpha = .55, colour = "grey80") +
geom_point(data = filter(pdat, threshold == TRUE),
aes(size = logFDR, fill=logFDR),
pch = 21, stroke = .25, alpha = .65, color = "grey59")
vp
?brewer_pal
brewer.pal(6, "PuRd")
## [1] "#F1EEF6" "#D4B9DA" "#C994C7" "#DF65B0" "#DD1C77" "#980043"
display.brewer.pal(6, "PuRd")
myPallete <- colorRampPalette(brewer.pal(8,"PuRd"))
myPallete(10)
## [1] "#F7F4F9" "#EAE5F1" "#DCCAE3" "#D0ACD3" "#CB8EC4" "#DC6AB2" "#E43C96"
## [8] "#DB1E72" "#C00E50" "#91003F"
show_col(myPallete(10))
show_col(myPallete(100), labels = FALSE)
vp <- vp +
scale_fill_gradientn(guide="colourbar", colours = myPallete(100)) +
guides(size=FALSE,
color=FALSE,
fill=guide_colourbar(title = expression(-log[10](FDR)))) +
labs(x = "Fold Change",
y = expression(-log[10](FDR))) +
geom_hline(yintercept=1, color="black", size=1) +
annotate(geom = "text", x=-6, y=0.75, label="italic(FDR) == 0.1",parse=TRUE) +
ggtitle("A Volcano Plot!")
vp
vp <- vp +
geom_label_repel(data = data.label,
aes(label = GENE), alpha = .95,
size = 4, color = "black",
box.padding = 0.5, segment.size = .5,
arrow = arrow(length = unit(0.015, 'npc')))
vp
theme()vp <- vp +
theme_minimal() +
theme(plot.title = element_text(lineheight = 1,face = "bold",size = 20),
axis.text = element_text(size = 12),
axis.title = element_text(size = 16, face = "bold"),
legend.text=element_text(size=13),
legend.position="bottom",
legend.title = element_text(size = 12,face = "bold"),
legend.key.size = unit(0.035, "npc"),
strip.text.x = element_text(size = 16, colour = "black"),
strip.text.y = element_text(size = 16, colour = "black"))
vp
Default is PDF format but many graphic devices are available.
?ggsave
ggsave( "volcano_plot.pdf",
vp, width = 10, height = 8)
ggsave( "volcano_plot.jpeg",
vp, width = 10, height = 8, device = "jpeg")
mpg dataggplot2 dataset, you need to first load tidyverse or ggplot2
Fuel economy data from 1999 and 2008 for 38 popular models of car
?mpg
head(mpg,3)
| manufacturer | model | displ | year | cyl | trans | drv | cty | hwy | fl | class |
|---|---|---|---|---|---|---|---|---|---|---|
| audi | a4 | 1.8 | 1999 | 4 | auto(l5) | f | 18 | 29 | p | compact |
| audi | a4 | 1.8 | 1999 | 4 | manual(m5) | f | 21 | 29 | p | compact |
| audi | a4 | 2.0 | 2008 | 4 | manual(m6) | f | 20 | 31 | p | compact |
mpg$class[1:20]
## [1] "compact" "compact" "compact" "compact" "compact" "compact" "compact"
## [8] "compact" "compact" "compact" "compact" "compact" "compact" "compact"
## [15] "compact" "midsize" "midsize" "midsize" "suv" "suv"
geom_bar is designed to make it easy to create bar charts that show counts. For example number of cars in each class
ggplot(mpg, aes(manufacturer)) +
geom_bar() +
theme(axis.text.x = element_text(size = 10, angle = 45))
To plot sub categories in each group, use aes(fill = x) function. This will let you color the bars. Bar charts are automatically stacked when multiple bars are “defined”" at the same location.
ggplot(mpg, aes(manufacturer)) +
geom_bar(aes(fill = class)) +
theme(axis.text.x = element_text(size = 10, angle = 45))
To have side by side plots for the groups you would like to compare, you need to use the position argument within the geom_ functions. You can either simply assign the name of positioning as a string or assign a position_ function for more details.
ggplot(mpg, aes(manufacturer)) +
geom_bar(aes(fill = class), position = "dodge")
ggplot(mpg, aes(manufacturer)) +
geom_bar(aes(fill = class),position = position_dodge(width = 0.4))
Plotting mean or other statistics of a variable directly with ggplot2 is less straightforward. One way is to use stat_summary_bin function.
?stat_summary_bin
ggplot(mpg, aes(class)) +
stat_summary(aes(y = displ), fun.y = "mean", geom = "bar")
A more straightforward approach is to use geom_col.
geom_col functionIf you would like to plot means or any other values other than the count of categories you need to provide them to ggplot() in the following format:
Assuming that outcome is the mean outcome of treatments (trt) a, b and c here.
df <- data.frame(trt = c("a", "b", "c"), outcome = c(2.3, 1.9, 3.2))
df
| trt | outcome |
|---|---|
| a | 2.3 |
| b | 1.9 |
| c | 3.2 |
Then use geom_col()
ggplot(df, aes(trt, outcome)) +
geom_col()
You can also change colors easily
ggplot(df, aes(trt, outcome)) +
geom_col(fill="steelblue")
ggplot(df, aes(trt, outcome)) +
geom_col(aes(fill=trt))
Another nice trick is to take advantage of group_by and summarise functions.
mpg %>%
group_by(class) %>%
summarise(cty_mean = mean(cty)) %>%
ungroup() %>%
ggplot(aes(class, cty_mean)) +
geom_col()
geom_bar() with continuous dataYou can use geom_bar() with continuous data, in which case it will values and show counts at unique locations. (But geom_histogram is more appropriate for this task.)
ggplot(mpg, aes(displ)) + geom_bar()
Histograms are basically barplots with “bins”. To construct a histogram,
The bins are usually specified as consecutive, non-overlapping intervals of a variable.
ggplot(iris, aes(Sepal.Length)) + geom_histogram(binwidth = 0.2)
ggplot(iris, aes(Sepal.Length, fill=Species)) +
geom_histogram(binwidth = 0.5)
ggplot(iris, aes(Sepal.Length, fill=Species)) +
geom_histogram(binwidth = 0.5, alpha=0.4)
ggplot(iris, aes(Sepal.Length, fill=Species)) +
geom_histogram(binwidth = 0.5, position = "identity", alpha=0.4)
A box plot is a easy way to get some overall idea about the distribution of your variable. Let’s have a quick review on how to read a box plot:
Boxplot (with an interquartile range) and a probability density function (pdf) of a Normal N(0,σ2) Population
ToothGrowth datasethead(ToothGrowth)
| len | supp | dose |
|---|---|---|
| 4.2 | VC | 0.5 |
| 11.5 | VC | 0.5 |
| 7.3 | VC | 0.5 |
| 5.8 | VC | 0.5 |
| 6.4 | VC | 0.5 |
| 10.0 | VC | 0.5 |
?ToothGrowth
Here’s how the the distribution of tooth length for different regimens look like.
gb <- ToothGrowth %>%
unite(regimen, -len, sep= "_") %>%
ggplot(data=., aes(regimen, len)) +
geom_boxplot(aes(color=regimen))
The error bars can define multiple types of variation:
Standard deviation (sd) shows how much each data point deviate from the mean “on average”. This is useful if you would like to have a general idea about how the data is distributed.
Standard error of the mean (sem) gives a value explaining how much your estimated mean is affected by the sampling variability.
95% Confidence interval (ci95) the interval which you are 95% certain that the true mean (or any estimated value/statistic) falls into. This is probably the best type of variation if we are really interested in estimating the mean.
As mentioned earlier, ggplot2 it’s easier to provide calculated summary statistics for plotting, rather than using the stats_ functions.
So, now let’s compute the mean, sd and se using summarise function.
ToothGrowthtg.stats <- ToothGrowth %>%
group_by(supp, dose) %>% # group by
summarise(.,
mean.len = mean(len), # mean
sd.len = sd(len), # sd
n = length(len) # count
) %>%
ungroup() %>%
mutate(se.len = sd.len/sqrt(n))
Now this is how tg.stats look
tg.stats
| supp | dose | mean.len | sd.len | n | se.len |
|---|---|---|---|---|---|
| OJ | 0.5 | 13.23 | 4.459708 | 10 | 1.4102837 |
| OJ | 1.0 | 22.70 | 3.910953 | 10 | 1.2367520 |
| OJ | 2.0 | 26.06 | 2.655058 | 10 | 0.8396031 |
| VC | 0.5 | 7.98 | 2.746634 | 10 | 0.8685620 |
| VC | 1.0 | 16.77 | 2.515309 | 10 | 0.7954104 |
| VC | 2.0 | 26.14 | 4.797731 | 10 | 1.5171757 |
Finally we can start plotting! Let’s start with line graphs.
Step by step:
For points, shape is the same as pch.
ggplot(tg.stats, aes(x=dose, y=mean.len, colour=supp)) +
geom_point(aes(x=dose, y= mean.len), size=4, shape=18)
ggplot(tg.stats, aes(x=dose, y=mean.len, colour=supp)) +
geom_point(aes(x=dose, y= mean.len), size=4, shape=18) +
geom_errorbar(aes(ymin=mean.len-sd.len, ymax=mean.len+sd.len), width=.1)
ggplot(tg.stats, aes(x=dose, y=mean.len, colour=supp)) +
geom_point(aes(x=dose, y= mean.len), size=4, shape=18) +
geom_errorbar(aes(ymin=mean.len-sd.len, ymax=mean.len+sd.len), width=.1) +
geom_line()
Just specify the original data set in another layer of geom_plot. Just like the volcano plot!
ggplot(tg.stats, aes(x=dose, y=mean.len, colour=supp)) +
geom_point(aes(x=dose, y= mean.len), size=4, shape=18) +
geom_errorbar(aes(ymin=mean.len-sd.len, ymax=mean.len+sd.len), width=.1) +
geom_line() +
# here I add the original data points!
geom_point(data=ToothGrowth, aes(x=dose, y=len, fill=supp), alpha=0.4, size=1.5)
Now that we have more or less the plot we want, let’s customize it further.
Modify axis, legend, and plot labels see Reference ggplot2: Scale
ggplot(tg.stats, aes(x=dose, y=mean.len, colour=supp)) +
geom_point(aes(x=dose, y= mean.len), size=4, shape=18) +
geom_errorbar(aes(ymin=mean.len-sd.len, ymax=mean.len+sd.len), width=.1) +
geom_line() +
geom_point(data=ToothGrowth, aes(x=dose, y=len, fill=supp), alpha=0.3, size=1.5) +
# Here I add the titles I want
labs(x="Supplement Dose",
y="Tooth Length",
title= "The Effect of Vitamin C on Tooth Growth in Guinea Pigs",
subtitle= "Comparing delivery methods: Orange Juice vs Ascorbic Acid",
caption = "Based on the data from R")
Define the limits of axes
ggplot(tg.stats, aes(x=dose, y=mean.len, colour=supp)) +
geom_point(aes(x=dose, y= mean.len), size=4, shape=18) +
geom_errorbar(aes(ymin=mean.len-sd.len, ymax=mean.len+sd.len), width=.1) +
geom_line() +
geom_point(data=ToothGrowth, aes(x=dose, y=len, fill=supp), alpha=0.3, size=1.5) +
# Here I add the titles I want
labs(x="Supplement Dose",
y="Tooth Length",
title= "The Effect of Vitamin C on Tooth Growth in Guinea Pigs",
subtitle= "Comparing delivery methods: Orange Juice vs Ascorbic Acid",
caption = "Based on the data from R") +
# Change x and y axes limits
xlim(0, 3) +
ylim(0, 60)
We saw in the previous session how to change legend parameters with guides() function. To change the legend title we can use the same function. See (Reference ggplot2 Guides)[http://ggplot2.tidyverse.org/reference/index.html#section-guides-axes-and-legends]
ggplot(tg.stats, aes(x=dose, y=mean.len, colour=supp)) +
geom_point(aes(x=dose, y= mean.len), size=4, shape=18) +
geom_errorbar(aes(ymin=mean.len-sd.len, ymax=mean.len+sd.len), width=.1) +
geom_line() +
geom_point(data=ToothGrowth, aes(x=dose, y=len, fill=supp), alpha=0.3, size=1.5) +
# Here I add the titles I want
labs(x="Supplement Dose",
y="Tooth Length",
title= "The Effect of Vitamin C on Tooth Growth in Guinea Pigs",
subtitle= "Comparing delivery methods: Orange Juice vs Ascorbic Acid",
caption = "Based on the data from R") +
# Change the legend
guides(colour=guide_legend("Supplement"), fill=guide_legend("Supplement"))
I want to customize the jitter of the points. To adjust positions look at position adjustment in ggplot2 Reference
pd <- position_dodge(0.1)
# my plot is getting very long, I'll assign it
tgp <- ggplot(tg.stats, aes(x=dose, y=mean.len, colour=supp)) +
geom_point(aes(x=dose, y= mean.len), size=4, shape=18, position=pd) +
geom_errorbar(aes(ymin=mean.len-sd.len, ymax=mean.len+sd.len), width=.1, position=pd) +
geom_line(position=pd) +
geom_point(data=ToothGrowth, aes(x=dose, y=len, fill=supp), alpha=0.3, size=1.5, position=pd) +
# Here I add the titles I want
labs(x="Supplement Dose",
y="Tooth Length",
title= "The Effect of Vitamin C on Tooth Growth in Guinea Pigs",
subtitle= "Comparing delivery methods: Orange Juice vs Ascorbic Acid",
caption = "Based on the data from R") +
# Change the legend
guides(colour=guide_legend("Supplement"), fill=guide_legend("Supplement"))
tgp
The dataset we’ll use contains the daily closing prices of major European stock indices: Germany DAX (Ibis), Switzerland SMI, France CAC, and UK FTSE. Be aware that this is a slightly modified version of the EuStockMarkets a dataset that comes with R and the dates in the data was simulated by me and doesn’t represent the actual dates. But it already comes in a friendly date format, so when we import it, R will already now that this columns is in date format.
euStock <- read_csv("data/EUstock_1992-1997.csv")
head(euStock)
| DAX | SMI | CAC | FTSE | date |
|---|---|---|---|---|
| 1577.26 | 1670.1 | 1765.7 | 2493.1 | 1992-01-01 |
| 1577.26 | 1670.1 | 1765.7 | 2493.1 | 1992-01-02 |
| 1598.19 | 1670.1 | 1749.9 | 2492.8 | 1992-01-03 |
| 1604.05 | 1704.0 | 1770.3 | 2504.1 | 1992-01-06 |
| 1604.69 | 1711.8 | 1787.6 | 2493.2 | 1992-01-07 |
| 1593.65 | 1700.5 | 1778.7 | 2482.9 | 1992-01-08 |
As we already know, this dataset can be cleaner. We would rather have all the stock indices listed under one column for plotting.
euStock <- euStock %>%
gather(Stock, Closing, -date)
head(euStock)
| date | Stock | Closing |
|---|---|---|
| 1992-01-01 | DAX | 1577.26 |
| 1992-01-02 | DAX | 1577.26 |
| 1992-01-03 | DAX | 1598.19 |
| 1992-01-06 | DAX | 1604.05 |
| 1992-01-07 | DAX | 1604.69 |
| 1992-01-08 | DAX | 1593.65 |
This looks much better!
Now let’s plot it!
p <- euStock %>%
ggplot(aes(x=date, y=Closing, color = Stock)) +
geom_line()
p
Now we can decide which date format we want to use for plotting. The table below explains how to provide the date format. Click here for more details.
Here some examples:
p+scale_x_date(date_labels = "%b")
p+scale_x_date(date_labels = "%Y %b %d")
p+scale_x_date(date_labels = "%m-%Y")
To see the x-axis, let’s angle out x-axis tick labels so we can see them clearly.
p<- p + theme(axis.text.x=element_text(angle=60, hjust=1, size=8))
p + scale_x_date(date_breaks = "1 month", date_labels = "%b")
p + scale_x_date(date_breaks = "3 month", date_labels = "%b %y")
p + scale_x_date(date_breaks = "3 month", date_labels = "%b %y",
date_minor_breaks = "2 week")
You can also plot a specific period by giving a range with first and last date of the period.
p + scale_x_date(limit=c(as.Date("1994-01-01"),as.Date("1996-01-31")))